Vocabulary Selection for a Broadcast News Transcription System using a Morpho-syntatic Approach

نویسندگان

  • Ciro Martins
  • António Teixeira
  • João Neto
چکیده

Although the vocabularies of ASR systems are designed to achieve high coverage for the expected domain, out-ofvocabulary (OOV) words cannot be avoided. Particularly, for daily and real-time transcription of Broadcast News (BN) data in highly inflected languages, the rapid vocabulary growth leads to high OOV word rates. To overcome this problem, we present a new morpho-syntatic approach to dynamically select the target vocabulary for this particular domain by trading off between the OOV word rate and vocabulary size. We evaluate this approach against the common selection strategy based on word frequency. Experiments have been carried out for a European Portuguese BN transcription system. Results computed on seven news shows, yields a relative reduction of 37.8% in OOV word rate against the baseline system and 5.5% when compared with the word frequency common approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vocabulary selection for a broadcast news transcription system using a morpho-syntactic approach

Although the vocabularies of ASR systems are designed to achieve high coverage for the expected domain, out-ofvocabulary (OOV) words cannot be avoided. Particularly, for daily and real-time transcription of Broadcast News (BN) data in highly inflected languages, the rapid vocabulary growth leads to high OOV word rates. To overcome this problem, we present a new morpho-syntatic approach to dynam...

متن کامل

Automatic estimation of language model parameters for unseen words using morpho-syntactic contextual information

Various information sources naturally contains new words that appear in a daily basis and which are not present in the vocabulary of the speech recognition system but are important for applications such as closed-captioning or information dissemination. To be recognized, those words need to be included in the vocabulary and the language model (LM) parameters updated. In this context, we propose...

متن کامل

Title generation for spoken broadcast news using a training corpus

The problem of title generation involves finding the essence of a document and expressing it in only a few words. The results of a query to the Informedia Digital Video Library are summarized through an automatically generated title for each retrieved news story. When the document is errorful, as with speech-recognized broadcast news stories, the title creation challenge becomes even greater. W...

متن کامل

Domain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament

The main goal of this work is the adaptation of a broadcast news transcription system to a new domain, namely, the Portuguese Parliament plenary meetings. This paper describes the different domain adaptation steps that lowered our baseline absolute word error rate from 20.1% to 16.1%. These steps include the vocabulary selection, in order to include specific domain terms, language model adaptat...

متن کامل

Dynamic language modeling for European Portuguese

This paper reports on the work done on vocabulary and language model daily adaptation for a European Portuguese broadcast news transcription system. The proposed adaptation framework takes into consideration European Portuguese language characteristics, such as its high level of inflection and complex verbal system. A multi-pass speech recognition framework using contemporary written texts avai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007